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Abstract 

Cultivated strawberry (Fragaria x ananassa) is octoploid and shows allogamous behaviour. The present 
study aims at dissecting this octoploid genome through comparison with its wild relatives, F. iinumae, F. nip- 
ponica, F. nubicola, and F. orientalis by de novo whole-genome sequencing on an lllumina and Roche 454 
platforms. The total length of the assembled lllumina genome sequences obtained was 698 Mb for F. x ana- 
nassa, and ~200 Mb each for the four wild species. Subsequently, a virtual reference genome termed 
FANhybrid_r1 .2 was constructed by integrating the sequences of the four homoeologous subgenomes of F. 
x ananassa, from which heterozygous regions in the Roche 454 and lllumina genome sequences were elimi- 
nated. The total length of FANhybrid_r1 .2 thus created was 1 73.2 Mb with the N50 length of 51 3 7 bp. The 
lllumina-assembled genome sequences of F. x ananassa and the four wild species were then mapped onto 
the reference genome, along with the previously published F. vesca genome sequence to establish the subge- 
nomic structure of F. x ananassa. The strategy adopted in this study has turned out to be successful in dissect- 
ing the genome of octoploid F. x ananassa and appears promising when applied to the analysis of other 
polyploid plant species. 

Key words: Fragaria x ananassa; wild Fragaria species; genome sequence assembly; comparative analysis; 
polyploidy 



1 . Introduction 

The cultivated strawberry (Fragaria x ananassa) is a 
globally consumed crop species that is grown around 
the world; in the USA (37.5%), Europe (36.7%), Asia 
(1 6.3%), Africa (8.6%), and Oceania (0.8%). 1 Fragaria 
x ananassa is an octoploid species (2n = 8 x =56) with 



an estimated genome size of 1C= 708-720 Mb. 2,3 
In addition to its polyploidy, allogamous behaviour in 
F. x ananassa contributes further complexity to the 
genome structure. Until now, three genome compos- 
ition models, namely AABBBBCC, 4 AAA'A'BBBB, 5 and 
AAA'A'BBB'B', 6 have been proposed based on cytologic- 
al and genetic evidence. Of the three models, the most 
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recently proposed model, AAA'A'BBB'B', 6 is considered 
the most probable candidate, since several studies 
have reported the disomic inheritance of large 
numbersofDNA markers, 7-9 which suggests an allopo- 
lyploid genome composition in F x ananassa. 

The genus Fragaria belongs to the family Rosaceae 
and is comprised of one cultivated (F. x ananassa) and 
24 wild species, including 1 3 diploids, 5 tetraploids, 1 
hexaploid,4 octoploids, and 1 decaploid. 10,1 1 The geo- 
graphic origins of the wild species are distributed 
throughout Eurasia, North and South America, and 
Japan. Fragaria x ananassa originated in the 1 700s 
from a natural hybridization between two octoploids, 
F. virginiana and F chiloensis.^ 2 However, the history of 
evolution from diploid to octoploid species in the 
genus Fragaria remains controversial. Davis etal. 3 pro- 
posed that F. vesca, F. nubicola, and F. orientalis were pos- 
sible progenitors to octoploids, based on the polygenic 
analysis of internal transcribed spacer sequences of 
rDNA and chloroplast DNA performed by Potter 
et al.^ 3 Rousseau-Gueutin et al.^ 4 hypothesized that 
the evolutionary history was based on five subgenomic 
entries, X1 , X2, Y1 , Y2, and Z, classified according to 
two nuclear genes, GBSSI-2 and DHAR. They proposed 
that the genomes of wild octoploids consisted of 
Y1'Y1'Y1"Y1"ZZZZ or Y1 Y1 Y1 Y1 ZZZZ genomes. The Y1 
genomes were considered to be derived from two 
diploid species, F vesca or F mandshurica, via a tetra- 
ploid, F. orientalis, whereas F. iinumae was presumed to 
be the progenitor of the Z genome. 

Fragaria vesca, which isthe most plausible progenitor 
of F. x ananassa, was selected as the genomic reference 
for Fragaria. Using a fourth-generation inbred line and 
the Roche 454, lllumina Solexa, and Life Technologies 
SOLiD platforms, the whole-genome sequence of 
F vesca was published in 2011. 15 According to the 
report, a total of 2 09.8 Mb were assembled into 2 72 
scaffolds, and 34 809 candidate genes were identified 
by gene prediction. Later, the original v1 .0 pseudomo- 
lecule assembly was updated to v1 .1 . 1 6 

The F vesca genome sequences have contributed 
greatly to advances in molecular genetic analysis in 
F x ananassa 8,9 and are expected to assist in the identi- 
fication of genes related to agriculturally important 
traits in F x ananassa, such as flowering time, 17 male 
sterility, 1 8 and stress resistance. 1 9 However, consultation 
of the F vesca genome has limitations, particularly for F x 
ananassa-spedf ic genome regions or genome regions de- 
rived from other progenitors. Therefore, whole-genome 
sequencing of F x ananassa is needed for a comprehen- 
sive understanding of the genome structure of the spe- 
cies, in parallel with reference to the F vesca genome. 

Next-generation sequencing (NGS) technologies 
have revolutionized denovo whole-genome sequencing 
in most species. This is especially true in plant species, 
for which genomes are relatively large and complex in 
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structure compared with non-plant genomes. 20,21 Up 
to now, whole-genome sequences have been published 
in more than 50 plant species, including five species in 
Rosaceae: F. vesca,^ 5 apple, 22 pear, 23 peach, 24 and 
Chinese plum. 25 However, to our knowledge, no 
reports of de novo whole-genome sequencing have 
been published for any polyploid species. As with straw- 
berry, diploid progenitor species have been employed 
to advance the genomic understanding of other poly- 
ploid crops, such as potato, 26 cotton, 27 and banana. 28 
Genome sequencing of polyploid species has been 
avoided, because the homoeology of subgenomes 
made sequence assembly difficult. To overcome this dif- 
ficulty, we attempted to construct a virtual 'reference 
genome' in octoploid strawberry, F. x ananassa, as a 
first step for whole-genome sequencing of the species. 

In this study, de novo whole-genome sequencing was 
performed in F x ananassa using the lllumina (lllumina, 
Inc., CA, USA) and Roche 454 sequencing platforms 
(Roche Diagnostics, IN, USA). Avirtual reference genome, 
which integrated genome sequences of homoeologous 
chromosomes, was constructed by eliminating hetero- 
zygous bases in the process of sequence assembly. In a 
previous study, macrosynteny at the chromosome level 
was observed between the linkage groups (LGs) in an F 
x ananassa genetic map and the genome of F vesca. 9 
Therefore, we aligned the generated F. x ananassa scaf- 
folds with the pseudomoleculesof F vesca? s \n parallel, 
four wild Fragaria species, representing genetic diversity 
in the genus Fragaria, were selected based on simple 
sequence repeat (SSR) markers and were subjected 
to whole-genome sequencing using an lllumina plat- 
form. The assembled sequences of the wild species, 
along with the heterozygous F x ananassa sequences, 
were mapped onto the F x ananassa reference genome 
and used to elucidate the genome structures of the 
Fragaria species. 

2. Materials and methods 

2.1. Plant materials 

AJapanese variety, 'Reikou', and its progeny (Si) were 
subjected to genome sequencing as representatives of 
F xananassa. 'Reikou' was bred in the Chiba Prefectural 
Agriculture and Forestry Research Center. Like other 
strawberry varieties, 'Reikou' maintains heterozygosity 
in the genome. The S-\ progenies were sequenced for 
SNP discovery within 'Reikou' for further analysis (data 
were not shown in this study). A total of 20 wild 
Fragaria species with a diverse range of polyploidy 
were used in phylogenetic analyses with SSR markers 
(Supplementary Table S1). Along with the 20 wild 
species, the followingfour Japanese Fxanaflassa varieties 
were subjected to the phylogenic analysis: 'Reikou', 
'Hokowase', 'Sachinoka', and Akihime'. Whole-genome 
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sequencing was subsequently performed for four wild 
species, F. iinumae,F. nipponica,F. nubicola, and F. orientalis. 
The genomic DNAwas extracted from young leaves using 
a DNeasy Plant Mini Kit (Qiagen, Inc., CA, USA). DNA 
quantification and quality checks were performed using 
a Nanodrop ND1 000 spectrophotometer (Nanodrop 
Technologies, DE, USA) and 0.8% agarose gel electrophor- 
esis, respectively. 

2.2. Phylogenetic analysis 

Genomic polymorphisms from 24 Fragaria accessions 
(listed in Supplementary Table S1) were analysed using 
632 F. x ananassa and F. vesca expressed sequence tag 
derived SSR markers that randomly mapped onto the 28 
LGs of the consensus genetic map of F. x ananassa 9 (Sup- 
plementary Table S2). PCR was performed as described 
in Isobe et al. 9 The PCR products were separated using 
an ABI 3730x1 fluorescent fragment analyser (Applied Bio- 
systems, CA, USA). Polymorph isms were investigated using 
the GeneMapper software (Applied Biosystems). The 
genetic distances and Jaccard's similarity coefficients of 
all combinations of any two samples were calculated 
from the genotypic data using the GGT2 software. 29 The 
hierarchical cluster analysis with the multiscale bootstrap 
was performed for 1 000 replications with Ward's method 



by pvclust package 30 for the R software (http://www.r- 
project.org/). 

2.3. Genome sequencing 

Whole-genome shotgun sequencing was carried out 
using the Roche 454 GS FLX+ and lllumina GAIIx/ 
Hiseq 1 000 platforms (Roche Diagnostics; lllumina, 
Inc., Fig. 1 , Supplementary Table S3). Total cellular DNA 
of the 'Reikou' variety of F. x ananassa was used for the 
construction of single-end (SE) and paired-end (PE) li- 
braries for the Roche 454 sequencing platform accord- 
ing to the instructions provided by the manufacturer. 
Insert sizes of the PE libraries were 3, 5, 8, and 20 kb. 

lllumina PE and mate-pair (MP) libraries were con- 
structed from total cellular DNAs of 'Reikou', five S-i pro- 
genies of 'Reikou', and four wild species, F. iinumae, 
F. nipponica, F. nubicola, and F. orientalis, according to 
the instructions provided by the manufacturer (Fig. 2 
and Supplementary Table S3). The expected insert 
sizes and read lengths of the libraries ranged from 
290 bp to 2 kb, and 51-101 bp, respectively. Of the 
developed lllumina libraries, two 'Reikou' PE libraries 
(with insertion sizes of 600 and 400 bp) were sub- 
jected to genome sequencing by the lllumina GAIIx, 
whereas other lllumina libraries were analysed by the 



454 reads (Roche 454 GS FLX+) 
Total: 2.6 Gb 



Newbler 2.7 
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+ 
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contaminating sequences 



llumina reads Total: 277 Gb 
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lllumina specific scaffolds and contigs 
Total length: 213.5Mb 



- Divide into sequences in 300 bp with 200 bp 
overlaps 

- Newbler 2.7 
(heterozygotic mode) 

-In cases when the divided sequences were 
not joined, they were merged based on the 
overlapping sequences. 

-Exclude probable contaminating sequences 



FANhybrid_r1.2 

Total: 173.2 Mb, N50: 5,137 bp 



Scaffolds, unassembled contigs and singlets 
(Total size: 72.0 Mb, N50: 411 bp) 



Figure 1. The strategy and status of sequencing and assembly of the reference genome of F. x ananassa (FANhybrid_r1 .2). PE, SE, and MP 
represent PE, SE, and MP reads, respectively. The FANhybrid_r1 .2 sequences (a black box) are consisted with the 454 scaffolds gap-closed 
with lllumina reads and lllumina-specific assembled sequences (gray boxes). 
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lllumina reads 
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F. x ananassa 

lllumina reads 
277 Gb 



SOAPdenovo v1. 05 (Kmer = 75) + GapCloser 1.10 (p= 31) 



Exclude less than 300 bp length sequences and probable contaminating sequences 
^ ^ ^ ^ 



Fll_r1.1 
Total: 199.6 Mb 
N50: 3,309 bp 



FNIj-1.1 
Total: 206.4 Mb 
N50: 1,275 bp 



FNU_r1.1 
Total: 203.7 Mb 
N50: 1,291 bp 



FOR_r1.1 
Total: 214.2 Mb 
N50: 722 bp 



FAN_r1.1 
Total: 697.8 Mb 
N50: 2,201 bp 



Figure 2. The strategy and status of lllumina sequence assembly of the genomes of F. x ananassa (FAN_r1 .1 ) and four wild species, F. iinumae 
(Fll_r1 .1 ), F. nipponica (FNI_r1 .1 ), F. nubicoia (FNU_r1 .1 ), and F. orientalis (FOR_r1 .1 ), based on lllumina reads. 



lllumina HiSeq 1 000. The mixed sequences generated 
from the five progenies were regarded as equivalent 
to the parental 'Reikou' genome and were assembled 
together with sequence data from the 'Reikou' libraries. 

2.4. Quality control of lllumina reads 

The lllumina reads were preprocessed for quality 
control using the FastX-toolkit, ver. 0.0.1 3 (http:// 
hannonlab.cshl.edu/fastx_toolkit/). After quality filter- 
ing, the sequences that included an N or were trimmed 
according to the following criteria were excluded from 
further analysis: (i) bases with quality value <10; 
(ii) probable artefact reads; and (iii) adaptor sequences 
consisting of more than five bases at the 3' terminal. 
The genome size of F x ananassa and the four wild rela- 
tives were estimated, based on a /<:-mer = 1 7 frequency 
of the lllumina reads, by using Jellyfish ver. 1 .1 .6. 31 

2.5. Assembly of the F. x ananassa reference genome 
The 454 reads were assembled using Newbler 2.7 

(Roche Diagnostics) in a heterozygotic mode (Fig. 1). 
The heterozygous bases between the homoeologous 
or heterozygous genomes were eliminated by the 
overlap layout consensus method, with sequence iden- 
tity of 90%. The gaps on the scaffolds were closed by 
GapCloser 1.10 (p = 31) (http://soap.genomics.org. 
cn/soapdenovo.html)forthe lllumina reads. In parallel, 
all the F. x ananassa lllumina reads were assembled 
using SOAPdenovo v1 .05 32 with /<:-mer= 75. After the 
assembly of the lllumina reads, the gaps on the scaffolds 
were closed by GapCloser 1.10 (p = 31). Illumina- 
specific scaffolds and contigs, when compared with 
the 454 scaffolds, were selected by conducting 
BLASTN 33 searches with an E-value cut-off of 1 E- 1 0. 
To eliminate heterozygous bases, the lllumina-specific 
scaffolds and contigs were re-assembled according to 
the approach employed to assemble the fire ant 



genome. 34 First, the sequences of lllumina-specific 
scaffolds and unassembled contigs were divided into 
300-bp lengths with 200-bp overlaps. The divided 
sequences were then re-assembled using Newbler 2.7 
in the heterozygotic mode. In cases when the divided 
sequences were not joined by Newbler 2.7, they were 
merged based on the overlapping sequences. 

Probable contaminating sequences on the 454 scaf- 
folds and the lllumina-specific sequences were identified 
and removed using BLASTN searches against the chloro- 
plast genome sequence of F. vesca (accession number: 
NC_01 5206.1), mitochondrial genome sequence of 
Arabidopsis thaliana (accession number: NC_001 284.2), 
bacterial genome sequences registered in NCBI (http:// 
www.ncbi.nlm.nih.gov), and vector sequences from 
UniVec (http://www.ncbi.nlm.nih.gov/tools/vecscreen/ 
univec/) with an £-value cut-off of 1E-10 and length 
coverage of >10%. The cleaned lllumina scaffolds, 
contigs, and singlets were integrated with the 454 scaf- 
folds and designated as FANhybrid_r1 .2. 



2.6. Assembly of lllumina reads from F. x ananassa 
and wild species 
The assembly of lllumina reads from F.xananassawas 
performed along with that of reads from the four wild 
species, F. iinumae, F. nipponica, F. nubicola, and F. orienta- 
lis (Fig. 2). The lllumina reads in each species were 
assembled using SOAPdenovo v1 .05 with k-mer = 75. 
The gaps on the scaffolds in each species were 
closed by GapCloser 1.1 0 (p = 31). Probable contam- 
inating sequences were identified and removed as 
described in the above section. The cleaned scaffolds 
and contigs >300 bp in length were designated 
FAN_r1 .1 (F. x ananassa), Fll_r1 .1 (F iinumae), FNI_r1 .1 
(F nipponica), FNU_r1.1 (F nubicola), and FOR_r1.1 
(F orientalis). 
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2.7. Repetitive sequences 

Probable repetitive sequences in the assembled 
genome of the Fragaria species were identified by 
RepeatScout 35 with default parameters based on the dis- 
covery of repetitive substrings in the sequences. In parallel, 
similarity searches and repeat masking were performed 
by RepeatMasker (http://www.repeatmasker.org) on the 
assembled genome sequences of the Fragaria species 
against known repetitive sequences registered in 
RepBase. 36 To identify novel repetitive sequences, the 
assembled genome sequences, with known repetitive 
sequences masked, we re subjected to second-round simi- 
larity search by RepeatMasker against probable repetitive 
sequences detected by RepeatScout. The SSR motifs with 
equal or greater numbers of defined repeats were 
searched for in exons, introns, and entire regions on the 
assembled genomes using SciRoKo with misa mode. 37 
The defined numbers of repeats in mono-, di-, tri-, 
tetra-, penta-, and hexa-motifs were 1 2, 6, 7, 5, 5, and 
5, respectively. 

2.8. Assignment of RNA-en coding genes 

Transfer RNA genes were predicted by using 
tRNAscan-SE ver. 1.23 38 with default parameters. 
Ribosomal RNA genes were identified by BLAT searches 
in BLAST output format, with an £-value cut-off of 
1 £- 1 0 and length cut-off of 70 bp, using 5.8S rRNA 
(accession number: X1 5589.1), 1 8S rRNA (accession 
number: X1 5590.1), and 26S rRNA (accession 
number: X581 1 8.1 ), in F. xananassa. 

2.9. Gene prediction and annotation 

The FANhybrid_r1 .2 and the five lllumina-assembled 
genomes were subjected to gene prediction and model- 
ling using Augustus 2. 7 39 with a training set from A. thali- 
ana and the following parameters; gene model = partial; 
protein, introns, start, stop, cds, coding seq, gff3, and 
UTR = on, and alternatives-from-evidence and alterna- 
tives-from-sampling = true. The genes related to trans- 
posable elements (TEs) were predicted according to the 
domain and product names of genes identified bysimilar- 
ity search in two databases, the Interpro database 40 and 
the NCBI's non-redundant protein sequence (NR) data- 
base (http://www.ncbi.nlm.nih.gov). Similarity searches 
in the former and later databases were performed by 
InterProScan 41 and BLASTX 33 with an £-value cut-off of 

1 .0 and 1 £-1 0, respectively. The Gypsy-type TEs were 
further identified by a search using the hmmscan 
module in HMMER 3.0 42 against the hidden Markov 
model in the Gypsy Database 2.O. 43 

2. 1 0. Comparative analysis between F. x ananassa 

and wild species 
BLAT searches 44 were performed for the sequences in 
FANhyblid_r1 .2 against the pseudomoleculesof F. vesca 



(v1.1) with -minscore = 1 00 and -minldentity = 
95. Subsequently, the sequences in FANhybrid_r1 .2 
were aligned with the genomic sequences of F. vesca 
based on the results of the BLAT searches. Furthermore, 
the hit regions were selected according to the following 
criteria: (match score + number of gaps in query) x 
100/query length (%) > 80, hit region in query x 
100/query length (%) > 80, match score/query 
length > 25, 80 < (hit region in subject x 1 00/query 
length) < 1 20. The best alignments in each query se- 
quence were identified approximately roughly identified 
using the pslReps program 44 in BLATwith -minCover = 
0.20. Validation and modification of the results were 
manually performed by eye. Furthermore, similarity 
searches were performed for the five lllumina- 
assembled genome sequences against FANhybrid_r1 .2 
by BLAT, and these sequences were aligned with 
the genome sequences of FANhybrid_r1 .2 using the 
pslReps program as described above. In parallel, the 
BLASTN searches with an £-value cut-off of 1£-10 
were performed between each possible pair of 
assembled genomes to investigate sequence similarity 
across the assembled genome sequences. 

BLAT searches were also performed for the five 
assembled genomes against the 34 809 candidate 
genes estimated on the F. vesca genome 15 with the 
same parameters as described above. In addition, a 
total of 1 04 annotated candidate genes in F. vesca^ 3 
were selected for further investigation of degree of du- 
plication in the assembled Fragaria genome sequences. 
Similarity searches between these 104 genes and 
the assembled sequences were performed by the 
FASTA36 program, 45 based on a cut-off value of 90% 
identity. In parallel, lllumina-assembled genes pre- 
dicted by Augustus 2.7 were subjected to cluster 
analysis with the 34 809 F. vesca candidate genes by 
using CD-hit v4.6.1 46 with the following parameters: 
c= 0.8, aS = 0.1. 

3. Results 

3.1 . Phylogenetic analysis o/Fragaria species 
with SSR markers 
A total of 632 SSR markers, which generated 9 54 loci 
mapped onto the consensus map of F. xananassa, 9 were 
employed for phylogenetic analysis of 2 0 wild Fragaria 
species and 4 F. xananassa varieties. The average number 
of observed allelic peaks per marker was 11.1, ranging 
from 1 to 57 in all the tested species, with an average 
of 1 .8 in each line, ranging from 1 .0 in Fragaria daltoni- 
ana to 3.2 in F. x ananassa 'Hokowase'. The genetic dis- 
tances between all combinations of any two lines 
were calculated based on the presence or absence of 
10 830 allelic peaks derived from the 632 markers. 
The genetic distances among the 24 lines ranged 
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from 0.07 to 0.33 (Supplementary Table S4). On the 
constructed phylogenetic tree, the 24 lines were classi- 
fied into four clusters (Supplementary Fig. S1A). The 
octoploids were clearly distinguished from Fragaria 
species of other ploidy (cluster C). Among the octo- 
ploids, two wild species, F. virginiana and F. chiloensis, 
were distinguishable from the cultivated species, F. x 
ananassa. The clusters A, B, and D consisted of six 
Fragaria species each. The relative genetic relations 
did not show large differences between a phylogenetic 
trees constructed with all samples and di-, tetra-, or 
octoploid-specific trees. Based on the phylogenetic 
analysis, the following four wild species were selected 
as representatives of Clusters A, B,and Dforsubsequent 
whole-genome sequencing: F. nipponica (Cluster A), 
F. iinumae (Cluster B),F. nubicola (Cluster D),and F. orien- 
talis (Cluster D). Plural species were selected from 
Cluster D, because the cluster showed the closest 
genetic distance to Cluster C. 

Genetic distances were investigated for each of the 
LGs on the consensus map according to the locations 
of the tested markers 9 (Supplementary Table S5). The 
mean genetic distances between F. x ananassa 'Reikou' 
and the 2 0 wild relatives ranged from 0.24 (LG3A) to 
0.34 (LG3D). In comparison, among the homoeolo- 
gous groups (HGs), HG5 showed the least variation 
between the corresponding LGs (0.28-0.30), 
whereas HG3 showed the greatest variation between 
the corresponding LGs (0.24-0.34). Fragaria vesca 
and F. chinensis showed closer genetic distances to 
F. x ananassa than any other diploid wild species for 
every LG except LG5D, for which the closest diploid 
species was F iinumae. 

3.2. Genome sequencing and quality control 

A total of 6 211 718 reads were obtained from F.xana- 
nassa SE and PE libraries using the Roche 454 GS FLX+ 
platform (Supplementary Table S3). The number of 
obtained lllumina reads in F. x ananassa, F. iinumae, F. 
nipponica, F. nubicola, and F. oriental is were 4 2 1 9 377 
380, 1 078824432, 1 075 841 020, 1 074353 
61 0, and 1 050 81 5 546, respectively. After quality 
control, 33% of the lllumina reads were excluded 
from further analysis. Consequently, the total bases 
that were subjected to subsequent analysis were 
276.6, 70.8, 72.2, 74.3, and 74.5 Gb, in F x ananassa, 
F. iinumae, F. nipponica, F. nubicola, and F oriental is, 
respectively. 

3.3. Estimation of genome size based on the lllumina 
reads 

The genome sizes of F x ananassa and the four wild 
relatives were estimated based on a /<-mer= 1 7 fre- 
quency of the lllumina reads (Supplementary Fig. S2). 
The genome sizes of the five Fragaria species were 



estimated as follows based on the distributions of dis- 
tinct A:mers along with consideration of the F vesca 
genome size (209.8 Mb)' 3 and polyploidy in each 
species: F x ananassa = 692 Mb, F. iinumae =221 Mb, 
F nipponica = 208 Mb, F. nubicola = 202 Mb, and 
F orientalis = 349.3 Mb. 



3.4. Assembly of the F. x ananassa reference genome 
All 454 reads were assembled using Newbler 2.7 in 

the heterozygotic mode (Fig. 1 ). During assembly, het- 
erozygous bases on the 454 reads were adopted based 
on the comparison of depth and accuracy between the 
bases. After the gaps were closed by GapCloser 1 .0, a 
total of 7598 scaffolds consisting of 1 01 21 8 723 bp 
with 19 520 552 N were obtained (Supplementary 
Table S6). The longest contig was 348 406 bp, and 
N50 was 46 803 bp. 

In parallel, a total of 4 602 723 scaffolds and contigs 
consisting of 1 301 006 845 bp were generated using 
SOAPdenovo v1.05 and GapCloser 1.10. After the 
BLASTN search of the lllumina sequences against the 
454 scaffolds, 1 1 53 521 lllumina-specific contigs 
were obtained, comprising 2 1 3.5 Mb. To assemble the 
lllumina-specific sequences using Newbler 2.7, the 
sequences that were longer than 300 bp were divided 
into 300-bp long sequences with 2 00-bp overlaps. As 
a result, a total of 1 437 769 lllumina sequences were 
re-assembled onto 40 416 215 bp (Supplementary 
Table S6). In addition, 1 37 608 unassembled singlets 
were obtained, including 1 0 330 repeats and 44 686 
outliers. The total length of lllumina-specific sequences 
was72 010 849 bp,and the N50 was 41 1 bp.Theinte- 
grated 454 scaffolds and lllumina-specific sequences 
were qualified as a reference genome for F x ananassa, 
designated FANhybrid_r1 .2. The number of sequences, 
total length, N50, and GC% of the FANhybrid_r1 .2 
were: 21 1 588 sequences, 1 73 229 572 bp, 51 3 7 bp, 
and 38.4%, respectively (Table 1 ). 

3.5. lllumina genome assembly in F. x ananassa 
and wild Fragaria species 

A total of 2 76.6 GbofF.xananassa lllumina reads were 
assembled by SOAPdenovo v1 .05 and GapCloser 1 .1 0. 
After excluding sequences of contaminating DNA, the 
number of sequences and total length of the lllumina- 
assembled sequences were 4 366 1 93 sequences and 
1 264 2 83 529 bp, respectively (Supplementary Table 
S7). The assembled sequences that were shorter than 
300 bp in length accounted for 86% of the total were 
excluded from further analysis. The selected sequences 
were totalled 697 765 214 bp with 39 320 129 Ns, 
and designated as FAN_r1.1. N50 and maximum 
length of the FAN_r1.1 were 2201 and 51 398 bp, re- 
spectively (Table 1 ). 
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Table 1. Statistics regarding the assembled genome sequences for F. x ananassa and four wild Fragaria species 
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C 
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40 883 595 


N 
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39 320 1 29 


20 741 


14 444 


1 3 982 


14 826 


Total (A + T + C + G) 


1 53 709 01 7 


658445 085 


199 606 904 


206 400 535 


203 672 594 


21 4 1 69 220 


GC% (GC/ATGC) 


38.4 


38.8 


39.1 


38.4 


38.2 


38.1 



In parallel, de novo genome sequence assembly was 
performed for lllumina reads of the four wild species 
using SOAPdenovovl .05 andGapCloser 1.1 0. As with F. 
x ananassa, assembled sequences of <300 bp in length 
were excluded from further analysis. The remaining 
sequences of F. iinumae, F. nipponica, F. nubicola, and F. 
orientalis were designated as Fll_r1.1, FNI_r1.1, 
FNU_r1.1, and FOR_r1.1, respectively (Table 1 and 
Supplementary Table S7). The total length and N50 in 
the assembled genome sequences of the four wild 
species ranged from 199.6 (Fll_r1.1) to 214.2 Mb 
(FOR_r1.1),and 722 (FOR_r1.1) to 3309 bp (Fll_r1.1), 
respectively. GC% ranged from 38.1% (FOR_1.1) to 
39.1%(FII_r1.1). 



3.6. Repetitive sequences 

The total length of repetitive sequences identified by 
RepeatMasker was 8 697 730 bp in FANhydrid_r1 .2 
(5.0% of the total length); 328 305 437 bp in FAN_r1.1 
(47.1%); 63 285 682 bp in Fll_r1.1 (31.7%); 52 580 
277 bp in FNI_r1.1 (25.5%); 49 943 1 56 bp in 
FNU_r1.1 (24.5%); and 56 272 506 bp in FOR_r1.1 
(26.3%) (Supplementary Table S8). When the same ap- 
proach was applied to the F. vesca genome (v1.1), 51 
223 702 bp(24.8%) was comprised of re peat sequences. 
Most of the identified repeats intheassembled genomes, 
except FANhybrid_r1 .2, were novel re peats that were not 
registered in RepBase. The ratios of the novel repeats to 
the total lengths of lllumina-assembled genome 
sequences ranged from 22.6% (FNU_r1.1) to 44.9% 
(FAN_r1.1), whereas that in the FANhybrid_r1 .2 was 
3.9%. The ratio of the total length of known interspersed 
repeats to the total genome length ranged from 0.3 5% 
(FANhybrid_r1.2) to 0.94% (FAN_r1.1). The Class 1 
long terminal repeat retrotransposons, including Copia 
and Gypsy types, were the most frequency observed in 
the known interspersed repeats. 



The total numbers of identified di- to hexa-nucleotide 
SSRs were 22 456 (FANhybrid_r1 .2), 1 10 251 (FAN_r1 .1 ), 
32 487 (Fll_r1.1), 35 188 (FNI_r1.1), 34 556 
(FNU_r1.1), and 32 340 (FOR_r1.1) (Supplementary 
Table S9). In each genome, di-nucleotide motifs were the 
most frequently observed type of SSR, ranging from 
57.6% (FANhybrid_r1.2) to 69.6% (Fll_r1.1). The (AT)„ 
motif was the most abundantly observed in the lllumina- 
assembled genomes, whereas (AG)„ wasthe most frequent 
motif in FANhybrid_r1 .2. 

3.7. RNA-encoding genes 

The total number of putative tRNA-encoding genes 
in FANhybrid_r1 .2 was 300, which was less than in 
the F. vesca scaffolds (Supplementary Table S1 0). 
Large numbers of putative tRNA-encoding genes 
(1 720) were identified in FAN_r1 .1 , while 424-51 4 
tRNA-encoding genes were predicted in the four wild 
lllumina-assembled genomes. The ratios of putative 
tRNA genes encoding amino acids ranged from 83.7% 
(FOR_r1.1) to 92.4% (Fll_r1.1). The total number of 
predicted rRNA-encoding genes in FANhybrid_r1 .2, 
FAN_r1 .1 , Fll_r1 .1 , FNI_r1 .1 , FNU_r1 .1 , and FOR_r1 .1 
were 18, 86, 37, 53, 19, and 45, respectively 
(Supplementary Table S1 1 ). 

3.8. Gene prediction and annotation 

The tota I n u m be r of pred icted genes i n FAN hy brid_r 1 . 2 
was 45 377; more genes than in F. vesca (v1.0) but 
less than in other assembled genome sequences 
(Supplementary Table S1 2). The total sequence length 
and N50 were 32 959 863 and 1 290 bp, respectively. 
Total numbers of predicted genes in the wild Fragaria 
sequences ranged from 76 760 (Fll_r1.1) to 99 674 
(FOR_r1 .1 ). Approximately 3.7-fold more total sequence 
length were identified in FAN_r1.1 compared with 
FANhybrid_r1 .2. The N50 of the five lllumina-assembled 
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genomes ranged from 484 (FOR_r1.1) to 948 bp 
(Fll_r1.1). 

The predicted genes were classified into TE genes 
versus non-TE genes (designated as intrinsic genes, 
Supplementary Table S1 3). The following three gene 
categories were subsequently notated: partial genes 
(without start or stop codons), pseudogenes (with in- 
frame stop codons), and short genes (encoding fewer 
than 49 aa). No notations were given for predicted 
full-length gene sequences framed by start and stop 
codons. In FANhybrid_r1 .2, 41 730 sequences (92% 
of the total sequences) were predicted to be intrinsic 
genes, while the other 3647 (8%) were annotated as 
TEs. Among the intrinsic genes in FANhybrid_r1 .2, 
41 % did not fall neatly into any of the three categories. 
This ratio was higher than that in the lllumina- 
assembled genome sequences. 

3.9. Comparative analysis between the genomes 
of cultivated and wild Fragaria species 
Of the 21 1 588 sequences in FANhybrid_r1 .2, 1 20 
703 showed significant similarity to pseudomolecules 
off. vesca (v1.1) by BLAT search (Supplementary Table 
S14). The lengths of the mapped FANhybrid_r1 .2 
sequences in each of the seven chromosomes ranged 
from 13 922 608 (Chr1) to 26 401 01 5 bp (Chr6). 
The FANhybrid_r1 .2 sequences showing significant 
similarity according to the BLAT search were well 
aligned across the F. vesca pseudomolecules 
(Supplementary Fig. S3). The mapped FANhybrid_r1 .2 
sequences included 145 387 993 bp of F. vesca pseu- 
domolecules, which covered 70.3% of the total length 
(Supplementary Table S14). Furthermore, the 
MEGABLAST 47 search (identity% >90 and £-value cut- 
off of 1E-50) was performed for the 90 885 
FANhybrid_r1 .2 sequences that did not align with the 
F vesca pseudomolecules, against the wild Fragaria 
genome sequences. Of the 90 885 FANhybrid_r1 .2 
sequences, 22 145 showed significant similarity 
against the F. vesca genome sequences, whereas 68 
740 did not (Supplementary Table S1 5). A total of 39 
692 FANhybrid_r1.2 sequences (total length: 1 0 742 
467 bp) showed no significant similarity to any of 
the genome sequences, and were concluded to be 
F. x oncmflssfl-specific. In parallel, BLASTN searches 
were performed between all pairwise combinations of 
the assembled genomes (Supplementary Table S1 6). 
The total length of non-homologous sequences 
in FANhybrid_r1 .2 compared with FAN_r1.1 was 
1.4 Mb, whereas that in FAN_r1.1 compared with 
FANhybrid_r1 .2 was 281.7 kb. Comparison with each of 
the five wild species revealed that the ratios of non-hom- 
ologous sequences to total sequences of FANhybrid_r1 .2 
and FAN_r1.1 ranged from 4.7% (FNU_r1.1) to 5.6% 
(FOR_r1 .1 ), and from 1 .3% (FNU_r1 .1 ) to 1 .8% (F. vesca 
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v1.1), respectively. In comparisons between the wild 
species, the smallest ratio of non-homologous sequences 
to the total was observed in FNU_r1 .1 against FNI_r1 .1 
(0.66%), whereas the largest ratio was observed in 
FNI_r1.1 against Fll_r1.1 (2.61%). 

A BLAT search mapped 142 699 FAN_r1.1 sequences 
(61 .8% of the total) onto the FANhybrid_r1 .2 sequence 
(Supplementary Table S1 7). Multiple FAN_r1.1 
sequences were often mapped onto the same 
FANhybrid_r1 .2 sequences. The numbers of single, 
double, triple, and quadruple FAN_r1 .1 sequences that 
mapped onto the same FANhybrid_r1 .2 sequences 
were 64 828 (45.4% of the mapped total), 31 922 
(22.4%), 1 3 656 (9.6%), and 6289 (4.4%), respectively. 
The FANhybrid_r1 .2 sequences that mapped onto five 
ormoreFAN_r1 .1 sequences tended to beclassified as re- 
petitive sequences (Fig. 3 and Supplementary Fig. S4). 
The sequence of each wild species was then mapped 
onto the entire genomic sequence of FANhybrid_r1 .2. 
The ratios of single and double sequences from the wild 
lllumina-assembled genomes that mapped onto the 
same FANhybrid_r1 .2 sequences ranged from 62.4% 
(FOR_r1.1) to 87.3% (Fll_r1.1), and from 9.5% 
(Fll_r1 .1 ) to 1 8.5% (FOR_r1 .1 ), respectively. The ratios 
of single and double sequences from the F vesca 
genome that mapped onto the same FANhybrid_r1 .2 
sequences were 72.7 and 4.8%, respectively. Fragaria 
vesca generated the largest number of top-hit sequences 
in each chromosome, followed by F iinumae (Fig. 3 and 
Supplementary Fig. S5). The BLASTN search was further- 
more performed for the FAN_r1.1 against the wild 
Fragaria genome sequences. Of the 62 5 966 FAN_r1.1 
sequences, 5447 (0.9%) were not identified significant 
similarity to the wild Fragaria sequences 
(Supplementary Table S1 8). The ratios of top-hit 
sequences in Fll_r1.1, FNI_r1 .1 , FNU_r1.1, FOR_r1.1, 
and F vesca (v1.1) to the FAN_r1.1 sequences were 
39.1 , 8.9, 8.1 , 8.2, and 34.8%, respectively. 

3.10. Comparative analysis between the genes 
in F. vesca and the assembled sequences 
in the Fragaria species 
To investigate the degree of gene duplication in each 
assembled genome, BLAT searches were performed for 
the six assembled genome sequences against the 34 
809 candidate genes identified on the F vesca 
genome. 15 The numbers of BLAT hit sequences in 
FANhybrid_r1 .2, FAN_r1 .1 , Fll_ri .1 , FNI_r1 .1 , FNU_r1 .1 , 
and FOR_r1.1 were 27 71 8 (79.6% of the 34 809 candi- 
date genes), 32 1 20 (92.3%), 27 560 (79.2%), 29 323 
(84.2%),31 869 (9 1.6%), and 31 901 (91 .6%), respect- 
ively. Single hits were the most frequently observed in all 
the assembled genome sequences except FAN_r1.1 
(Supplementary Fig. S6a). The distribution of FAN_r1.1 
differed from those of the other assembled genome 
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Figure 3. Position and coverage of the lllumina-assembled genome sequences mapped onto the FANhybrid_r1 .2 sequence (left), and 
frequency of the top-hit sequences from the five wild Fragaria species against FANhybrid_r1 .2 (right). The central black and white bars 
indicate the genome sequences of FANhybrid_r1 .2 aligned based on the homologous sequence positions on F. vesca (v1 .1 ) Chr1 . Details 
are described in Supplementary Figs S3 and S4. 



sequences; the peak in frequency was quite broad, and 
triple hits were the most frequently observed. Of the four 
wild species, the F. orientalis (FOR_r1 .1 ) genome showed 
broader peaks than the others, whereas sharper peaks 
were observed in the F. iinumae genome (Fll_r1.1). 
Cluster analysis was performed on the gene sequences 
predicted as 'intrinsic gene' and 'intrinsic gene/partial' 
(Supplementary Table S1 3) in the lllumina-assembled 
genome sequences of E x ananassa and the diploids 
(Supplementary Fig. S7) against 34 809 F. vesca candidate 
genes. The numbers of gene sequencesthatdid not cluster 
in any other species were 24 596 (14.2% of the total 
sequences) in FAN_r1.1; 3989 (6.7%) in Fll_r1.1; 6351 
(8.8%) in FNI_r1 .1 ; 3399 (4.8%) in FNU_r1 .1 ; and 1 492 
(4.6%) in F. vesca. The BLAT search results were further 
investigated for 1 04 genes that were annotated as agricul- 
turally important traits in a previous study 15 
(Supplementary Table S1 9). The numbers of genes 
that had no hits to any of the assembled genomes 
were as follows: 17 (FANhybrid_r1 .2), 10 (Fll_r1.1), 3 
(FNI_r1.1), 7 (FNU_r1.1), and 6 (FOR_r1.1). All the 
F. vesca genes showed hits to the FAN_r1.1 sequences. 
The distributions of the hit sequences were similar to the 
BLAT search results against the 34 809 genes, except the 
highest numbers of hit sequences in FNI_r1.1, and 
FOR_r1.1 were 2 (Supplementary Fig. S6b). 



4. Discussion 

The phylogeneticanalysiswith 632 markers revealed 
that F. vesca wasthe most closely related diploid species 



to F.xananassa. The constructed phylogenetic tree sug- 
gested that F. vesca and F. nubicola were genetically 
closer to F. x ananassa than F. iinumae and F nipponica. 
A comparison of the genetic distances between all 
pairs of the three diploids, F. iinumae, F. nipponica, and 
F. nubicola, indicated that F. iinumae and F. nipponica 
were more closely related than the other pairs. This 
result diverged from that of a previous report, 10,14 
which distinguished the genome of F. iinumae from 
those of F. nipponica and F nubicola. It is considered 
that differences in investigated regions and scales may 
be responsible for the inconsistency to the previous 
studies. We concluded that the four wild species, F. 
iinumae, F. nipponica, F. nubicola, and F orientalis, were 
good representatives of the genetic diversity in the 
genus Fragaria and were subjected to further genome 
sequencing analysis. 

The genome size of F x ananassa was estimated as 
692 Mb, based on the multiplicity of distinct A:-mers 
in the lllumina reads. Despite the similarity in size to 
that determined in previous studies (708- 
72 0 Mb), 2,3 it was less than 4-fold the size of the F. 
vesca genome (209.8 Mb). 13 Similarly, the genome 
size ofF. orientalis (349.3 Mb) was less than 2-fold the 
size of the F vesca genome. Therefore, we suspected 
that the genome sizes of F x ananassa and F. orientalis 
were underestimated. The estimated genome size of 
the three diploids, F iinumae, F. nipponica, and F. nubi- 
clola, was similar to that of F. vesca, and we therefore 
concluded that the estimated values were close to the 
true values. Most of the species, except F. orientalis, 
showed two peaks in the distribution of the number 
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of distinct fc-mers, and the distinct peaks detected on 
larger multiplicity were employed for the genome size 
estimation. Because not all the sequenced materials 
were from homozygous lines, the peaks identified on 
smaller multiplicity we re assumed to stem from hetero- 
zygous sequences. The peaks em ployed for genome size 
estimations in F. x ananassa and F. orientalis were quite 
broad. The broad shape of the peaks may have resulted 
from the homoeologous nature of the sequences, and 
this could also have led to the underestimation of the 
genome sizes. 

The largest obstacle to genome sequence assembly of 
octoploid F. x ananassa was the homoeologyof the sub- 
genomes. In addition, outcrossing behaviour of F. x ana- 
nassa generates allelic heterozygosity within pairs of 
homoeologous genomes. Since up to eight heterozy- 
gous sequences exist in the F. x ananassa genome, it was 
predicted that this heterozygosity would create diffi- 
culty in assembling chromosome-specific sequences. 
Therefore, we tried to construct a virtual reference 
genome that could integrate genome sequences of 
homoeologous or heterozygous chromosomes byelim- 
inating heterozygous bases. A number of assembling 
programmes were developed for the assembly of 
sequences generated by the NGS platforms. The base 
strategies were mainly classified into two methods: 
the overlap graph method and the de Bruijn graph 
method. 48,49 The overlap graph method lays out a con- 
sensus paradigm based on overlapping sequences, 
whereas the de Bruijn graph method segments se- 
quence reads into /<:-mers, and then assembles /<-mers 
based on paths represented on the graphed reads. 
Because the overlap graph method assumessimilarities 
between the reads in sequence assembly, it is able to 
eliminate heterozygous bases. The Newbler 2.7 and 
SOAPdenovo v1.05 are typical assemblers employing 
the overlap graph method and the de Bruijn graph 
method, respectively. Therefore, Newbler 2.7 was used 
for the assembly of the reference genome sequences, 
whereas SOAPdenovo v1 .05 was used for the assembly 
of the sequence that maintained heterozygosity. The 
total length of the assembled genome, FANhybrid_r1 .2, 
was 1 73.2 Mb, which was shorter than that of F. vesca. 
The N50 of the 454 scaffolds was 46 803 bp, whereas 
that of the lllumina scaffolds and contigs was 411 bp. 
Therefore, FANhybrid_r1 .2 consisted of long 454 scaf- 
folds and short lllumina sequences. 

The total length of all lllumina-assembled genome 
sequences in F x ananassa was 1 264.3 Mb, which was 
~2-fold the genome size estimated by the Jellyfish 
program, and more than 6-fold the genome size of F 
vesca. The N50 was quite short, 406 bp. The long total 
length and short N50 suggested that assembly of the 
lllumina reads was obstructed by large numbers of het- 
erozygous sequences. Hence, contigs shorter than 
300 bp were excluded from further analysis. As a 
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result, the total length of FAN_r1.1 was reduced to 
69 7.8 Mb, which was close to the length estimated by 
Jellyfish. The result of the similarity search against the 
candidate genes in F. vesca (Supplementary Fig. S6) indi- 
cated that high heterozygosity was maintained in 
FAN_r1.1 sequences. As with FAN_r1.1, the total 
lengths of lllumina-assembled genome sequences in 
the three diploids were similar to the genome size esti- 
mated by Jellyfish. However, the total length of F. orien- 
talis (FOR_r1 .1 ) was 2 1 4.2 Mb, which was ~6 1 %of the 
genome size as estimated by Jellyfish. This result indi- 
cated a possibility of over-exclusion of sequences 
caused by the removal of sequences < 3 00 bp in length. 

The total lengths of repeat sequences in the four wild 
lllumina-assembled genome sequences were close to 
that of F vesca, whereas the FAN_r1.1 sequence was 
~6.4 times that of F vesca (Supplementary Table S8). 
This result implied that the total length of repeat 
sequences in FAN_r1.1 was overestimated due to the 
high heterozygosity in the sequences. In contrast, the 
total length of the repeat sequences in FANhybrid_r1 .2 
was quite short, 0.1 7 that of F vesca. We considered 
that excessive integration had occurred as a result of 
the elimination of heterozygous bases through the 
use of the heterozygotic mode in Newbler 2.7. Similar 
patterns were observed in the numbers of tRNAs 
(Supplementary Table S1 0) and the length and 
numbers of candidate genes predicted by Augustus 2.7 
(Supplementary Table S1 2). The frequency of SSRs 
ranged from 1 2.7 (FANhybrid_r1 .2) to 1 6.8 (FNI_r1.1) 
per 1 00 kb across the five assembled genome 
sequences. This agreement in observed SSR frequency 
suggested that SSR identification was not affected by 
the heterozygosity in the assembled genome sequences. 
An SSR frequency in intron regions was 2- to 3-fold that 
in exon regions. Liu et al. 50 reported loss of rDNA site 
numbers in octoploids by fluorescence in situ hybridiza- 
tion using 5S and 2 5S rDNA probes. However, we found 
no clear evidence of structural changes in the genomes 
during the evolutionary transition from diploids to octo- 
ploids based on the features of the assembled genome 
sequences. 

The BLAST analysis between the whole assembled 
genome sequences (Supplementary Table S1 6) showed 
that sequences in FAN_r1.1 covered larger regions in 
the F x ananassa genome than those in FANhybrid_r1 .2. 
The total length of non-homologous sequences was 
only 281 769 bp in FAN_r1.1 against FANhybrid_r1 .2, 
whereas it was 1.4 Mb in FANhybrid_r1 .2 against 
FAN_r1.1. The ratio of non-homologous sequence to 
total sequence did not exceed 3% in comparisons 
between any two pairs of assembled sequences. It was 
predicted that the percentages of non-homologous se- 
quence between the four subgenomes in F x ananassa 
(A,A',B,and B') would reflectthose between the ancestral 
species. Therefore, we estimated that the percentage of 
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non-homologous sequence between the four subge- 
nomes did not exceed 3%. 

Based on the results of the BLAT analysis, 57% of the 
sequences in FANhybrid_r1 .2 aligned with F. vesca pseu- 
domolecules, while the other 43% of the sequences did 
not. The large number of non-aligning sequences in the 
BLAT analysis may be related to the strict thresholds 
used (i.e. the total length of the subject sequence must 
show significant similarity over 80- 1 2 0% of the length 
of the query sequence). In the lllumina-assembled 
genomes, multiple sequences were often mapped onto 
the same FANhybrid_r1 .2 sequences, and this might 
reflect the heterozygosity of the corresponding regions. 
This multiple mapping was observed across the genome 
sequence, and we concluded that the heterozygosity 
was randomly distributed in F x ananassa and the four 
wild species sequenced in this study. In comparisons of 
the numbers of top-hit sequences of the five wild 
species, slight difference was observed between the 
results of BLAT analysis against FANhybrid_r1 .2 and 
BLASTN search against FAN_r1.1. In the result of BLAT 
analysis, F. vesca showed the highest numbers of the 
top-hit sequences (Supplementary Fig. S5), while nearly 
equal ratio was observed in F. iinumae (39.1%) and F. 
vesca (34.8%) in the BLASTN search (Supplementary 
Table S1 8). We considered that the result of BLASTN 
search was more accurate, because the result of BLATana- 
lysis was affected by the length of assembled sequences in 
the subject (wild species) sequences. The result of 
BLASTN search suggested that the genomes of F. vesca 
and F. iinumae equally contributed as progenitors of F. 
x ananassa, and agreed with previous studies. 3,10,14 On 
the other hand, 1 4.2% of the FAN_r1 .1 intrinsic genes 
were predicted as unique genes in F. x ananassa. These 
results suggested that the genome of F. x ananassa has 
diverged from the other Fragaria species during the 
process of evolution. 

In this study, we dissected the octoploid F. x ananassa 
genome through comparisons between reference and 
heterozygous genome sequences in F. x ananassa and 
sequencing analysis of the genomes of wild Fragaria 
relatives. To our knowledge, this is the first genomic 
analysis of a polyploid species, and we expect that this 
approach will be applied to the genomic analysis of 
other polyploid species. On the other hand, one re- 
markable issue remained; that is, the distinction 
between homoeologous and allelic heterozygous se- 
quences. The small percentages of non-homologous se- 
quence across the assembled genomes suggested that 
most of the genes in F. x ananassa have homoeologous 
sequences in the genome. To distinguish subgenome- 
specific sequences, we considered that introducingseg- 
regation analysis into the genome sequence assembly 
would provide a more effective approach. Therefore, 
St progenies of 'Reikou' were subjected to further ana- 
lysis. The results obtained should contribute to progress 



in the genomic and genetic analyses of the octopoly- 
ploid species, F. x ananassa. 

5. Database 

The genome assembly data, annotations, and gene 
models of F x ananassa and wild Fragaria species are avail- 
able at the Strawberry GARDEN (http://strawberry- 
garden.kazusa.or.jp). All sequence data (assembled 
sequences and genome sequence reads by NGSs) are 
available through the international databases (DDBJ/ 
GenBank/EMBL) under the umbrella project number 
PRJDB1445. The accession numbers of assembled 
sequences, FANhybrid_r1 .2, FAN_r1.1, Fll_r1.1, 
FNI_r1.1, FNU_r1.1,and FOR_r1.1 are BATS01 000001 - 
BATS01 220286, BATT01 000001 -BATT01 714282, 
BATU01 000001 -BATU01 1 1 8549, BATV01 000001 - 
BATV01 21 5530, BATW01 000001 -BATW01 2 1 1 274, 
and BATX01 000001 -BATX01 323675, respectively. The 
genome sequence reads obtained by Roche 454 GS 
FLX+ and lllumina GAII/HiSeq are available from the 
DDBJ Sequence Read Archive (DRA) under the accession 
number DRA001 114. 
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